home *** CD-ROM | disk | FTP | other *** search
- Path: mail2news.demon.co.uk!genesis.demon.co.uk
- From: Lawrence Kirby <fred@genesis.demon.co.uk>
- Newsgroups: comp.lang.c
- Subject: Re: Matrix Multiplication
- Date: Mon, 29 Jan 96 13:43:44 GMT
- Organization: none
- Message-ID: <822923024snz@genesis.demon.co.uk>
- References: <1996Jan22.110440@gamma.ntu.ac.sg> <4ecjv4$fn2@tamarack.cs.mtu.edu> <822792812snz@genesis.demon.co.uk> <DLwzv9.9wu@cwi.nl>
- Reply-To: fred@genesis.demon.co.uk
- X-NNTP-Posting-Host: genesis.demon.co.uk
- X-Newsreader: Demon Internet Simple News v1.27
- X-Mail2News-Path: genesis.demon.co.uk
-
- In article <DLwzv9.9wu@cwi.nl> dik@cwi.nl "Dik T. Winter" writes:
-
- > > >Consider optimizing for cache performance.
- >[Omitting the initialization]
- > > >for ( i = 0; i < N; i++ )
- > > > for ( j = 0; j < N; j++ )
- > > > for ( k = 0; k < N; k++ )
- > > > c[i][j] = c[i][j] + a[i][k]*b[k][j];
- > >
- > > Or how about:
- > > double sum = 0.0;
- >...
- > > c[i][j] = sum;
- > > You compiler might be able to perform that optimisation for itself but
- > > aliasing issues may make the analysis difficult.
- >
- >In this case aliasing might make it difficult indeed or even impossible,
- >and doing it is indeed an improvement. But the proposed version is *not*
- >optimal for caching; the optimal way to do it is to make the 'j' loop
- >innermost:
- > for ( i = 0; i < N; i++)
- > for (k = 0; k < N; k++)
- > for (j = 0; j < N; j++)
- > c[i][j] = c[i][j] + a[i][k]*b[k][j];
- >[and this is also optimal on vector processors].
- >The reason is that if at some stage c[i][j] (or b[k][j]) has a cache miss
- >a whole cache line is brought in, and this includes some elements of 'c' and
- >'b' that are needed on next steps through the loop. Putting k innermost
- >gives the benefit only to 'a'. (But it will also depend a lot on the
- >cache replacement strategy used by the processor.)
-
- Right. Since in my version c is only ever written to cache line reads
- are not necessarily an issue. It is also outside the innermost loop so
- (in your case where N=1000) a cache miss compared to a 1000 iteration
- loop (with 1000 potential cache misses for b) should be insignificant.
- Indeed the chance of being able to accumulate the result in the inner loop
- in a register is likely to be more significant.
-
- >Timings on an SGI with a 100 MHz MIPS R4k (in seconds with N = 1000, 10
- >multiplications):
- >i innermost 11.08
- >k innermost 5.92
- >k innermost + scalar 4.60
- >j innermost 2.59
-
- I guess you'd need to analyse the object code to see what's going on there.
-
- Still, this proves that where performance is critical there is no substitute
- for profiling the alternatives on the target system. It might be worth trying:
-
- /* zero all elements of c here */
-
- for (i = 0; i < N; i++) {
- for (k = 0; k < N; k++) {
- const double aik = a[i][k];
-
- for (j = 0; j < N; j++)
- c[i][j] += aik * b[k][j];
- }
- }
-
- since that is an optimisation the compiler definitely can't make for itself
- if it can't tell that a and c are distinct.
-
- Actually this makes it clear why your version works fast - *everything*
- in the inner loop caches well. The only disadvantage is that the c update
- requires an 'external' read-modify-write sequence. The performance of that
- will depend on how write caching is implemented.
-
- Maybe C could do as well as FORTRAN in this case.
-
- --
- -----------------------------------------
- Lawrence Kirby | fred@genesis.demon.co.uk
- Wilts, England | 70734.126@compuserve.com
- -----------------------------------------
-